I. BackgroundΒΆ
Employee turnover represents one of the most significant challenges facing organizations, with far-reaching implications for productivity, morale, and bottom-line performance. This case study examines the employee retention challenges at Salifort Motors, where rising departure rates have prompted leadership to conduct a comprehensive analysis of the underlying factors driving talent attrition. Understanding why valued team members choose to leave is crucial not only for reducing costly turnover and recruitment expenses, but also for creating a more engaging, supportive work environment that attracts and retains top talent in an increasingly competitive automotive industry landscape.
The dataset contains a mix of numerical and categorical features. Key variables include satisfaction level, evaluation scores, project count, and working hours - all potential indicators of employee satisfaction and workload. The target variable 'left' indicates whether an employee has left the company (1) or not (0).
This problem is a binary classification task where we need to predict whether an employee will leave (1) or stay (0). The business context is important - employee turnover is costly, so identifying factors that contribute to employees leaving can help HR develop targeted retention strategies.
We'll follow a standard data science workflow: data loading, exploration, preprocessing, model building, and evaluation. We'll use pandas and numpy for data manipulation, matplotlib and seaborn for visualization, and scikit-learn for machine learning models.
II. Data Preparation - Inspect and CleanΒΆ
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, auc, roc_curve, roc_auc_score
# Load dataset into a dataframe and display the first few rows of the dataframe
df = pd.read_csv("HR_dataset.csv")
df.head(5)
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | Department | salary | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14999 entries, 0 to 14998 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 satisfaction_level 14999 non-null float64 1 last_evaluation 14999 non-null float64 2 number_project 14999 non-null int64 3 average_montly_hours 14999 non-null int64 4 time_spend_company 14999 non-null int64 5 Work_accident 14999 non-null int64 6 left 14999 non-null int64 7 promotion_last_5years 14999 non-null int64 8 Department 14999 non-null object 9 salary 14999 non-null object dtypes: float64(2), int64(6), object(2) memory usage: 1.1+ MB
df.describe(include='all')
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | Department | salary | |
---|---|---|---|---|---|---|---|---|---|---|
count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999 | 14999 |
unique | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 10 | 3 |
top | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | sales | low |
freq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4140 | 7316 |
mean | 0.612834 | 0.716102 | 3.803054 | 201.050337 | 3.498233 | 0.144610 | 0.238083 | 0.021268 | NaN | NaN |
std | 0.248631 | 0.171169 | 1.232592 | 49.943099 | 1.460136 | 0.351719 | 0.425924 | 0.144281 | NaN | NaN |
min | 0.090000 | 0.360000 | 2.000000 | 96.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN |
25% | 0.440000 | 0.560000 | 3.000000 | 156.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN |
50% | 0.640000 | 0.720000 | 4.000000 | 200.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN |
75% | 0.820000 | 0.870000 | 5.000000 | 245.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN |
max | 1.000000 | 1.000000 | 7.000000 | 310.000000 | 10.000000 | 1.000000 | 1.000000 | 1.000000 | NaN | NaN |
Almost a quarter of employees surveyed left the company.
# Display all column names
df.columns
Index(['satisfaction_level', 'last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company', 'Work_accident', 'left', 'promotion_last_5years', 'Department', 'salary'], dtype='object')
Some inconsistencies in the column names exist. We will standardize the column names so that they are all in snake_case, correct any column names that are misspelled, and make column names more concise.
# Rename columns as needed
rename_dict = {'Department':'department', 'Work_accident':'work_accident', 'promotion_last_5years':'promotion_last_5_years', 'average_montly_hours' : 'average_monthly_hours'}
df.rename(mapper=rename_dict, axis=1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14999 entries, 0 to 14998 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 satisfaction_level 14999 non-null float64 1 last_evaluation 14999 non-null float64 2 number_project 14999 non-null int64 3 average_monthly_hours 14999 non-null int64 4 time_spend_company 14999 non-null int64 5 work_accident 14999 non-null int64 6 left 14999 non-null int64 7 promotion_last_5_years 14999 non-null int64 8 department 14999 non-null object 9 salary 14999 non-null object dtypes: float64(2), int64(6), object(2) memory usage: 1.1+ MB
# Change work_accident, left and promotion_last_5_years to categorical for purposes of data exploration
df['work_accident'] = df['work_accident'].astype('category')
df['left'] = df['left'].astype('category')
df['promotion_last_5_years'] = df['promotion_last_5_years'].astype('category')
Check for any missing values in the data.ΒΆ
df.isna().sum()
satisfaction_level 0 last_evaluation 0 number_project 0 average_monthly_hours 0 time_spend_company 0 work_accident 0 left 0 promotion_last_5_years 0 department 0 salary 0 dtype: int64
Check for any duplicate entries in the data.ΒΆ
df_dups = df[df.duplicated(keep=False) == True]
df_dups.sort_values(by = ['satisfaction_level','last_evaluation','time_spend_company', 'department']).head(6)
satisfaction_level | last_evaluation | number_project | average_monthly_hours | time_spend_company | work_accident | left | promotion_last_5_years | department | salary | |
---|---|---|---|---|---|---|---|---|---|---|
30 | 0.09 | 0.62 | 6 | 294 | 4 | 0 | 1 | 0 | accounting | low |
12030 | 0.09 | 0.62 | 6 | 294 | 4 | 0 | 1 | 0 | accounting | low |
14241 | 0.09 | 0.62 | 6 | 294 | 4 | 0 | 1 | 0 | accounting | low |
71 | 0.09 | 0.77 | 5 | 275 | 4 | 0 | 1 | 0 | product_mng | medium |
12071 | 0.09 | 0.77 | 5 | 275 | 4 | 0 | 1 | 0 | product_mng | medium |
14282 | 0.09 | 0.77 | 5 | 275 | 4 | 0 | 1 | 0 | product_mng | medium |
Although it is possible that a handful of employees should share similar features, it is unlikely that this many records (over 5000) could be unique observations that just by happenstance have the same exact values for each feature. Moreover, there appears to be some systematic pattern to the duplicated values with the difference between the first and second duplicates being 12000 rows and difference between the second and third values about 2211 rows. We will proceed to remove the duplicate entries.
# Drop duplicates and save resulting dataframe in a new variable as needed
df.drop_duplicates(inplace=True)
# Display first few rows of new dataframe
df.head()
satisfaction_level | last_evaluation | number_project | average_monthly_hours | time_spend_company | work_accident | left | promotion_last_5_years | department | salary | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
Examining outliersΒΆ
import pandas as pd
def calculate_outlier_table(df, columns):
"""Return a DataFrame summarizing outlier percentages for specified columns.
Args:
df (pd.DataFrame): The DataFrame containing the data.
columns (list of str): List of column names to check for outliers.
Returns:
pd.DataFrame: Table with outlier percentages for each column.
"""
results = []
for col in columns:
if col in df.columns:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
total = len(df)
left_outliers = df[df[col] < lower_bound]
right_outliers = df[df[col] > upper_bound]
left_pct = (len(left_outliers) / total) * 100 if total > 0 else 0
right_pct = (len(right_outliers) / total) * 100 if total > 0 else 0
total_pct = left_pct + right_pct
results.append({
'column': col,
'outlier_percentage': total_pct,
'left_outlier_percentage': left_pct,
'right_outlier_percentage': right_pct
})
else:
results.append({
'column': col,
'outlier_percentage': None,
'left_outlier_percentage': None,
'right_outlier_percentage': None
})
return pd.DataFrame(results)
numerical_features = []
categorical_features = []
for col in df.columns:
if (df[col].dtype == 'int64') or (df[col].dtype == 'float64'):
numerical_features.append(col)
else:
categorical_features.append(col)
print(calculate_outlier_table(df, numerical_features))
column outlier_percentage left_outlier_percentage \ 0 satisfaction_level 0.000000 0.0 1 last_evaluation 0.000000 0.0 2 number_project 0.000000 0.0 3 average_monthly_hours 0.000000 0.0 4 time_spend_company 6.871821 0.0 right_outlier_percentage 0 0.000000 1 0.000000 2 0.000000 3 0.000000 4 6.871821
# Create a boxplot to visualize distribution of `tenure` and detect any outliers
sns.boxplot(data=df, x='time_spend_company')
plt.show()
sns.boxplot(data=df, x='time_spend_company', y='department')
plt.show()
sns.histplot(data=df, x='time_spend_company', bins=4, binrange=(2,10), hue='left')
<Axes: xlabel='time_spend_company', ylabel='Count'>
len(df[df.time_spend_company >= 4]) / len(df)
0.32449337002752066
III. Explore Data and RelationshipsΒΆ
Taking a cross-section of those who left versus those who stayed, we can see that, not surprising, that employees who left where more likely to indicate low satisfaction levels when surveyed. We can also see that employees who left worked 10 more hours each monthly, on average, than those who stayed. This excess in hours is likely one of the main factors driving employees' decisions to leave.
df.groupby('left', observed=False).mean(numeric_only=True)
satisfaction_level | last_evaluation | number_project | average_monthly_hours | time_spend_company | |
---|---|---|---|---|---|
left | |||||
0 | 0.667365 | 0.715667 | 3.786800 | 198.94270 | 3.262000 |
1 | 0.440271 | 0.721783 | 3.883476 | 208.16223 | 3.881467 |
df.groupby('department').mean(numeric_only=True)
satisfaction_level | last_evaluation | number_project | average_monthly_hours | time_spend_company | |
---|---|---|---|---|---|
department | |||||
IT | 0.634016 | 0.715051 | 3.797131 | 200.638320 | 3.350410 |
RandD | 0.627176 | 0.712983 | 3.850144 | 201.291066 | 3.319885 |
accounting | 0.607939 | 0.721900 | 3.834138 | 200.877617 | 3.404187 |
hr | 0.621947 | 0.715691 | 3.675541 | 199.371048 | 3.256240 |
management | 0.631995 | 0.726307 | 3.837156 | 201.529817 | 3.981651 |
marketing | 0.634770 | 0.718440 | 3.720654 | 199.487370 | 3.421991 |
product_mng | 0.629825 | 0.713790 | 3.794461 | 198.893586 | 3.341108 |
sales | 0.631349 | 0.710398 | 3.777092 | 200.242050 | 3.380673 |
support | 0.634822 | 0.722998 | 3.820977 | 200.627128 | 3.292696 |
technical | 0.627937 | 0.719791 | 3.859180 | 201.115419 | 3.309269 |
Plotting the distribution of Average Monthly Hours for both those who left and those that stayed on the same graph we can see that employees that who left fall into two groups, either employees who work less than a standard work month of 160 hours (4 weeks * 40 hours) and those who significantly exceed a standard work month (i.e., 200+ hours).
sns.histplot(data=df, x='average_monthly_hours', hue='left')
<Axes: xlabel='average_monthly_hours', ylabel='Count'>
Data visualizationsΒΆ
def plot_feature_distributions(data, target, features):
"""
Plots boxplot and histplot side by side for each feature in the given list.
Parameters:
data (pd.DataFrame): The DataFrame containing the features.
features (list): List of feature names to plot.
"""
num_features = len(features)
# Set figure size dynamically based on the number of features
fig, axes = plt.subplots(num_features, 2, figsize=(12, 4 * num_features))
if num_features == 1:
axes = [axes] # Ensure axes is iterable when there's only one feature
for i, feature in enumerate(features):
sns.boxplot(x=data[feature], hue=data[target], ax=axes[i][0])
axes[i][0].set_title(f'Boxplot of {feature}')
sns.histplot(data[feature], kde=True, ax=axes[i][1])
axes[i][1].set_title(f'Histogram of {feature}')
plt.tight_layout()
plt.show()
def plot_cat_features(data, target, features):
"""
Plots boxplot and histplot side by side for each feature in the given list.
Parameters:
data (pd.DataFrame): The DataFrame containing the features.
features (list): List of feature names to plot.
"""
num_features = len(features)
# Set figure size dynamically based on the number of features
fig, axes = plt.subplots(num_features, 2, figsize=(12, 4 * num_features))
if num_features == 1:
axes = [axes] # Ensure axes is iterable when there's only one feature
for i, feature in enumerate(features):
sns.barplot(x=data[feature], y=data[target], ax=axes[i][0])
axes[i][0].set_title(f'Barplot of {feature}')
sns.histplot(data[feature], kde=False, ax=axes[i][1])
axes[i][1].set_title(f'Histogram of {feature}')
plt.tight_layout()
plt.show()
plot_feature_distributions(df, 'left', numerical_features)
A review of the boxplots reinforces the observations from the cross section analysis above. We also see some evidence that employees with tenures approaching 4-5 years of tenure may be more likely to leave, possibly because of a lack, or perceived lack, of advancement opportunities.
plot_cat_features(df, 'left', categorical_features)
From the above barplots there appears to be some differentiation in the likelihood an employee leaves among different departments and salary groups. We can see that employees in the Management and R&D departments are less likely to leave than employees in the other departments. Not surprisingly, we note that as salary increases the likelihood that an employee separates from the company declines.
# Plot a correlation heatmap
plt.figure(figsize=(16, 9))
heatmap = sns.heatmap(df.corr(numeric_only=True), vmin=-1, vmax=1, annot=True, cmap=sns.color_palette("vlag", as_cmap=True))
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':14}, pad=12);
The correlation heatmap provide additional support of a relationship between employees leaving and satisfaction level and time spent and the company. We also gain some insight into what may be driving satisfaction levels, as there is a slight negative correlation between satisfaction levels and both number of projects and time spent at the company. This may mean that there is a population of employees who feel they have too many or too few projects, while another group may feel that they have outgrown their current role and do not feel challenged or rewarded.
IV. ModelingΒΆ
We will begin by prepping the training and testing data, fitting baseline several linear and non-linear baseline models, and comparing the performance of each. Specifically we will look at logistic regression, a decision tree classifier, a random forest classifier and a gradient boosting classifier.
df1 = pd.get_dummies(df, columns=['salary' ,'department'], drop_first=True)
X = df1.drop(['left'], axis=1)
y = df1['left']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, stratify=y, random_state=11)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
nummerical = ['satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'time_spend_company']
X_train_scaled= scaler.fit_transform(X_train[nummerical])
X_train_scaled = pd.DataFrame(X_train_scaled, columns=nummerical, index=X_train.index)
X_train_scaled = pd.concat([X_train_scaled, X_train[X_train.columns[~X_train.columns.isin(nummerical)]]], axis=1)
X_train_scaled.head(5)
satisfaction_level | last_evaluation | number_project | average_monthly_hours | time_spend_company | work_accident | promotion_last_5_years | salary_low | salary_medium | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3670 | -0.586807 | 1.204042 | 1.031037 | -0.500513 | -0.272195 | 0 | 0 | True | False | False | False | False | False | True | False | False | False | False |
892 | 0.449539 | 1.619914 | 0.170436 | 1.551623 | 1.210005 | 0 | 0 | True | False | False | False | False | False | False | False | False | True | False |
5262 | 0.200816 | 1.501093 | 1.031037 | 0.135649 | 0.468905 | 1 | 0 | True | False | True | False | False | False | False | False | False | False | False |
8218 | -1.830423 | 0.788170 | 1.891639 | 0.197213 | 0.468905 | 0 | 0 | False | True | False | False | False | False | False | False | False | True | False |
6934 | 0.698262 | -0.043575 | 0.170436 | -0.008000 | -1.013295 | 0 | 0 | True | False | False | False | False | False | False | False | False | False | False |
X_test_scaled= scaler.fit_transform(X_test[nummerical])
X_test_scaled = pd.DataFrame(X_test_scaled, columns=nummerical, index=X_test.index)
X_test_scaled = pd.concat([X_test_scaled, X_test[X_test.columns[~X_test.columns.isin(nummerical)]]], axis=1)
X_test_scaled.head(5)
satisfaction_level | last_evaluation | number_project | average_monthly_hours | time_spend_company | work_accident | promotion_last_5_years | salary_low | salary_medium | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7750 | -0.016480 | -1.275368 | 0.166662 | -0.055929 | -0.281343 | 1 | 0 | True | False | False | False | False | False | False | False | False | False | True |
10014 | 1.522283 | -0.740813 | 1.023698 | 1.278413 | -0.281343 | 0 | 0 | False | False | False | False | False | False | False | False | True | False | False |
8462 | -0.266009 | 1.694382 | -0.690374 | 1.463168 | -1.068158 | 0 | 0 | False | True | False | False | False | False | False | False | False | False | False |
4562 | -0.224421 | 0.268902 | -0.690374 | -0.117514 | -0.281343 | 0 | 0 | False | True | False | False | False | False | False | False | False | False | True |
5090 | 0.732108 | -0.800208 | -0.690374 | -0.076457 | -0.281343 | 0 | 0 | False | True | False | False | False | False | False | False | False | True | False |
Explore Baseline ModelsΒΆ
lg = LogisticRegression(random_state=42)
tree = DecisionTreeClassifier(random_state=42)
rf = RandomForestClassifier(random_state=12)
gb = GradientBoostingClassifier(random_state=12)
models = [lg, tree, rf, gb]
num_models = len(models)
cols = 2 # Number of columns in the grid
rows = math.ceil(num_models / cols)
fig, axes = plt.subplots(rows, cols, figsize=(15, 5*rows))
axes = axes.flatten()
for i, model in enumerate(models):
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_pred, labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot(ax=axes[i])
axes[i].set_title(f"Confusion Matrix for {model.__class__.__name__}")
# Remove unused subplots
for j in range(i+1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
def evaluate_models(models, X_test, y_test):
"""
Evaluates multiple trained models on test data and prints a comparison table.
Parameters:
models (dict): A dictionary where keys are model names and values are trained sklearn models.
X_test (pd.DataFrame or np.array): Test feature set.
y_test (pd.Series or np.array): True labels for the test set.
"""
results = []
for name, model in models.items():
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='binary')
recall = recall_score(y_test, y_pred, average='binary')
f1 = f1_score(y_test, y_pred, average='binary')
roc_auc = roc_auc_score(y_test, y_prob) if y_prob is not None else "N/A"
results.append([name, accuracy, precision, recall, f1, roc_auc])
# Create DataFrame for better visualization
results_df = pd.DataFrame(results, columns=["Model", "Accuracy", "Precision", "Recall", "F1-Score", "ROC-AUC"])
print(results_df)
# Usage:
trained_models = {
"Decision Tree": tree,
"Random Forest": rf,
"Logistic Regression": lg,
"GradientBoost": gb
}
evaluate_models(trained_models, X_test_scaled, y_test)
Model Accuracy Precision Recall F1-Score ROC-AUC 0 Decision Tree 0.960640 0.892562 0.867470 0.879837 0.923335 1 Random Forest 0.976651 0.981982 0.875502 0.925690 0.984909 2 Logistic Regression 0.828552 0.466942 0.226908 0.305405 0.810240 3 GradientBoost 0.973983 0.956522 0.883534 0.918580 0.987375
Fine Tuning ParametersΒΆ
The baseline model results show that tree-based models (decision tree, random forest and gradient boosting) significantly outperformed logistic regression, achieving higher accuracy, precision, recall, and F1-scores. This suggests that the relationships between the features and employee turnover are complex and non-linear. We will take a closer look at both the random forest and gradient boosting models since they exhibit similar performance. We will fine tune both and examine what features both models consider important in making decisions.
### Building a Random Forest Model
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV
params = {
'n_estimators': [100, 150, 200, 250, 300],
'max_depth': [3, 5, 7, 13],
}
rf = RandomForestClassifier(random_state=42)
rf_grid = GridSearchCV(
estimator=rf,
param_grid=params,
scoring=['roc_auc', 'accuracy', 'f1'],
refit='f1',
cv=5,
verbose=0
)
rf_grid.fit(X_train_scaled, y_train)
print("")
print(f'The best random forest model features: {rf_grid.best_params_}, with a score of {rf_grid.best_score_}')
print("")
# Get predictions and confusion matrix
y_pred = rf_grid.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_pred, labels=rf_grid.classes_)
# Get top 10 feature importances
feat_impt = rf_grid.best_estimator_.feature_importances_
ind = np.argpartition(feat_impt, -10)[-10:]
feat = X.columns[ind]
feat_impt_top = feat_impt[ind]
feat_df_sorted = pd.DataFrame({"Feature": feat, "Importance": feat_impt_top}).sort_values("Importance")
# Create side-by-side plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=rf_grid.classes_)
disp.plot(ax=ax1)
ax1.set_title("Confusion Matrix")
# Feature importance
feat_df_sorted.plot(kind='barh', x="Feature", y="Importance", ax=ax2, legend=False)
ax2.set_title("Gradient Boosting: Feature Importances for Employee Leaving")
ax2.set_xlabel("Importance")
ax2.set_ylabel("Feature")
plt.tight_layout()
plt.show()
The best random forest model features: {'max_depth': 13, 'n_estimators': 100}, with a score of 0.9447511631209704
Building a Gradient Boosting ModelΒΆ
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
params = {
'n_estimators': [100, 150, 200],
'max_depth': [3, 5, 7, 13],
'learning_rate': [1, .1, .01],
'min_samples_leaf': [.9, 1]
}
gb = GradientBoostingClassifier(random_state=42)
gb_grid = GridSearchCV(
estimator=gb,
param_grid=params,
scoring=['roc_auc', 'accuracy', 'f1'],
refit='f1',
cv=5,
verbose=0
)
gb_grid.fit(X_train_scaled, y_train)
print("")
print(f'The best gradietnt boosting model features: {gb_grid.best_params_}, with a score of {gb_grid.best_score_}')
print("")
# Get predictions and confusion matrix
y_pred = gb_grid.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_pred, labels=gb_grid.classes_)
# Get top 10 feature importances
feat_impt = gb_grid.best_estimator_.feature_importances_
ind = np.argpartition(feat_impt, -10)[-10:]
feat = X.columns[ind]
feat_impt_top = feat_impt[ind]
feat_df_sorted = pd.DataFrame({"Feature": feat, "Importance": feat_impt_top}).sort_values("Importance")
# Create side-by-side plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=gb_grid.classes_)
disp.plot(ax=ax1)
ax1.set_title("Confusion Matrix")
# Feature importance
feat_df_sorted.plot(kind='barh', x="Feature", y="Importance", ax=ax2, legend=False)
ax2.set_title("Gradient Boosting: Feature Importances for Employee Leaving")
ax2.set_xlabel("Importance")
ax2.set_ylabel("Feature")
plt.tight_layout()
plt.show()
The best gradietnt boosting model features: {'learning_rate': 0.01, 'max_depth': 7, 'min_samples_leaf': 1, 'n_estimators': 100}, with a score of 0.9461263304457613
V. Conclusion, Recommendations, Next StepsΒΆ
The feature importances extracted from the models suggest that a major contributor to employees leaving is their heavy workload as the number of projects and average monthly hours worked play into both models.
To retain employees, the following recommendations could be presented to the stakeholders:
- Cap the number of projects that employees can work on.
- Consider promoting employees who have been with the company for at least four years, or conduct further investigation about why four-year tenured employees are so dissatisfied.
- Either reward employees for working longer hours, or don't require them to do so.
- If employees aren't familiar with the company's overtime pay policies, inform them about this. If the expectations around workload and time off aren't explicit, make them clear.
- Hold company-wide and within-team discussions to understand and address the company work culture, across the board and in specific contexts.
- High evaluation scores should not be reserved for employees who work 200+ hours per month. Consider a proportionate scale for rewarding employees who contribute more/put in more effort.
Next Steps
It may be justified to still have some concern about data leakage. It could be prudent to consider how predictions change when last_evaluation
is removed from the data. It is possible that evaluations are not performed very frequently, in which case it would be useful to be able to predict employee retention without this feature. It is also possible that the evaluation score determines whether an employee leaves or stays, in which case it could be useful to pivot and try to predict performance score. The same could be said for satisfaction score.